Refactor Wan Model Training & Add Wan-VACE Training Support by ninatu · Pull Request #352 · AI-Hypercomputer/maxdiffusion

ninatu · 2026-03-11T14:35:33Z

This PR introduces several improvements and fixes to the Wan model training, as well as adds support for training Wan-VACE models.

Key changes include:

Bug fixes:
- Resolved training mode bug when dropout > 0 (e.g., ensured rngs parameter is passed to layer_forward for gradient checkpointing with dropout)
- Fixed prepare_sample_fn usage for 'tfrecord' dataset type.
- Addressed checkpoint loading issues with larger TPU slices and different topologies for Wan 2.1.
- Corrected timestep sampling for continuous sampling
Config updates:
- Ensured adam_weight_decay is a float.
- Added tensorboard_dir parameter for logging.
- Now uses config.learning_rate instead of a hardcoded value.
- Set default dropout to 0.0 in WAN configs (instead of 0.1).
Wan-VACE Support:
- Refactoring: Common training components (initialization, scheduler, TFLOPs calculation, training/eval loops) have been abstracted into a new BaseWanTrainer ABC to improve code structure and reusability.
- Added new scripts (train_wan_vace.py), trainer (wan_vace_trainer.py), and checkpointing logic (wan_vace_checkpointing_2_1.py) to enable training of WAN-VACE models.
New Features:
- Introduced config.disable_training_weights to optionally disable mid-point loss weighting.
- Added logging for max_grad_norm and max_abs_grad.

- Ensure `adam_weight_decay` is a float. - Add `tensorboard_dir` parameter for logging. Co-authored-by: martinarroyo <martinarroyo@google.com>

- Conditionally apply dropout only when rate > 0. - Use standard list initialization. - Add rngs parameter to layer_forward (essential for gradient checkpointing with dropout > 0) Co-authored-by: martinarroyo <martinarroyo@google.com>

Replaces the hardcoded learning rate in the optimizer creation with the value from `config.learning_rate`. Co-authored-by: martinarroyo <martinarroyo@google.com>

Co-authored-by: martinarroyo <martinarroyo@google.com>

…lices and different topologies Co-authored-by: martinarroyo <martinarroyo@google.com>

…ling and introduce disable_training_weights, add max_grad_norm and max_abs_grad logging. - Switched timestamp sampling from discrete to continuous. - Add max_grad_norm and max_abs_grad calculation and logging. - Introduced `config.disable_training_weights` to optionally disable mid-point loss weighting. Co-authored-by: martinarroyo <martinarroyo@google.com>

The following key functionalities have been moved from WanTrainer to the new `BaseWanTrainer` ABC: - Initialization and config handling - Scheduler creation - TFLOPs calculation - Core training and evaluation loops (`start_training`, `training_loop`, `eval`) - Abstract methods for checkpointer, data loading, sharding, and step functions. Co-authored-by: martinarroyo <martinarroyo@google.com>

Introduces training support for WAN-VACE models. New files: - train_wan_vace.py: Main training script. - wan_vace_trainer.py: Trainer class for WAN-VACE. - wan_vace_checkpointing_2_1.py: Checkpointing logic for WAN-VACE. Co-authored-by: martinarroyo <martinarroyo@google.com>

github-actions · 2026-03-11T14:35:44Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

entrpn · 2026-03-12T20:59:25Z

src/maxdiffusion/configs/base_wan_14b.yml

 # Output directory
 # Create a GCS bucket, e.g. my-maxtext-outputs and set this to "gs://my-maxtext-outputs/"
 base_output_directory: ""
+tensorboard_dir: ""


tensorboard_dir is created automatically inside the pyconfig. Is there a reason it needs to be in the config?

You're right, it's indeed no longer needed. It was necessary in an older version of the code, but it's redundant now. I've removed it in the latest commit

entrpn · 2026-03-12T21:03:46Z

As this is a fairly large refactor:

@prishajain1 can you do a review of the checkpointing changes?
@susanbao can you take a quick look at the training changes?

prishajain1 · 2026-03-13T13:51:38Z

src/maxdiffusion/checkpointing/wan_checkpointer_2_1.py

+
    params_restore = ocp.args.PyTreeRestore(
        restore_args=jax.tree.map(
            lambda _: ocp.RestoreArgs(restore_type=np.ndarray),


Passing restore_type = np.ndarray causes the JAX sharding applied above to be redundant. (JAX sharding cannot work on np.ndarrays). Suggest to make it jax.Array to ensure checkpoint is loaded on host in sharded manner if that's intended

prishajain1 · 2026-03-13T16:46:21Z

src/maxdiffusion/checkpointing/wan_checkpointer_2_1.py

-    abstract_tree_structure_params = jax.tree_util.tree_map(ocp.utils.to_shape_dtype_struct, transformer_metadata)
+    state = metadatas.wan_state
+
+    def add_sharding_to_struct(leaf_struct, sharding):


A safer way to do this to prevent unexpected crashes (for any elements not having shape/dtype attributes):

def add_sharding_to_struct(leaf_struct, sharding): struct = ocp.utils.to_shape_dtype_struct(leaf_struct) if hasattr(struct, "shape") and hasattr(struct, "dtype"): return jax.ShapeDtypeStruct( shape=struct.shape, dtype=struct.dtype, sharding=sharding ) return struct

- Address review comments: removed tensorboard_dir because it is created automatically inside the pyconfig. Co-authored-by: martinarroyo <martinarroyo@google.com>

ninatu and others added 9 commits March 11, 2026 14:22

Update wan configs for training

fb25b23

- Ensure `adam_weight_decay` is a float. - Add `tensorboard_dir` parameter for logging. Co-authored-by: martinarroyo <martinarroyo@google.com>

Wan training: use learning rate from config

1fe4ce0

Replaces the hardcoded learning rate in the optimizer creation with the value from `config.learning_rate`. Co-authored-by: martinarroyo <martinarroyo@google.com>

Fix: Ensure prepare_sample_fn is used for 'tfrecord' dataset type

5c6f65f

Co-authored-by: martinarroyo <martinarroyo@google.com>

Wan training: Set default dropout to 0.0 in Wan configs

6101386

Co-authored-by: martinarroyo <martinarroyo@google.com>

Wan 2.1 training: Resolve checkpoint loading issues with larger TPU s…

efbc91d

…lices and different topologies Co-authored-by: martinarroyo <martinarroyo@google.com>

ninatu requested a review from entrpn as a code owner March 11, 2026 14:35

entrpn reviewed Mar 12, 2026

View reviewed changes

prishajain1 reviewed Mar 13, 2026

View reviewed changes

Fix: remove unnecessary tensorboard_dir from wan configs

dec1690

- Address review comments: removed tensorboard_dir because it is created automatically inside the pyconfig. Co-authored-by: martinarroyo <martinarroyo@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Wan Model Training & Add Wan-VACE Training Support#352

Refactor Wan Model Training & Add Wan-VACE Training Support#352
ninatu wants to merge 10 commits intomainfrom
ninatu/wan_training

ninatu commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

entrpn Mar 12, 2026

Uh oh!

ninatu Mar 16, 2026

Uh oh!

entrpn commented Mar 12, 2026

Uh oh!

prishajain1 Mar 13, 2026

Uh oh!

prishajain1 Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ninatu commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

entrpn Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ninatu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

entrpn commented Mar 12, 2026

Uh oh!

prishajain1 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

prishajain1 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants